Processing Homonyms in the Kana-to-Kanji Conversion


  • Masahito Takahashi
  • Tsuyoshi Shinchu
  • Kenji Yoshimura
  • Kosho Shudo

This p a p e r I)roI)oses two new methods to ident i fy the correct meaning of J apanese h o n m n y m s in t ex t based on tile i l o u n : v e r b co occ I l r r enc ( ~, ill a s e n t e n c e which (:an be ob ta ined easi ly from corpora . The first m e t h o d uses the n e a r co -occur rence da ta se ts , which are cons t ruc t ed f rom the above (:o-occurrence re la t ion, to select the most fe~Lsible word among h o m o n y m s in the s(:ol)e of a sea> tence. Ti le se(:ond uses the flu' cooccurrence da ta se ts , which are cons t r u t t e d d y n a m i c a l l y fl 'om the n e a r cooccurrence da ta s e t s in the course of processing inpu t sentences , to select the most feasible word among h o m o n y m s ill the s(:ope of a sequence of sentences. An expe r imen t of k a n a t o k a n f i ( p h o n o g r a n > t o ideograph) conversion has shown tha t the convers ion is carr ied out at the accuracy ra te of 79.6% per word by the first me thod . This accuracy ra te of our me thod is 7.4% higher than t ha t of the o rd ina ry m e t h o d based on the word occurrence frequency. 1 I n t r o d u c t i o n Process ing hontonynLs, i.e. ident i fy ing the correct meaning of h o m o n y m s in text , is one of the most i m p o r t a n t phases of k a n a t o k a n j i conversion, curren t ly the most popu l a r m e t h o d for int)ut t ing J apanese cha rac te r s in to a compu te r . Recently, severM new me thods fi)r processing homonyms , based on neural ne tworks(Kol)ayashi ,1992) or tile co-occurrence re la t ion of words(Yamamot<),1992) , have been proposed . These me thods apl)ly to the co-occurrence re la t ion of words not only in a s e n t e n c e b u t a l so ill a s e q u e n c e of sentellC(~s. I t a p p e a r s impra<:ticat)le to p repa re a neural network for co-oecurren(:e d a t a large e n o u g h to h a n dle 50,000 to 100,000 J a p a n e s e words. In this 1)aper, we p ropose two uew me thods for process ing J apanese h o m o n y m s based on the (:ooccurrence re la t ion be tween a noun and a verb ill a s e n t e n c e . W e have defined two co-occurrence d a t a sets. One is a set of nouns ~ c o m p a n i e d by a case mark ing par t ic le , e~:h e lement of which has a set of co-occurr ing w~rbs in a sentence. The o ther is a set of verbs accompan ied by a case mark ing p a r t M e , each e lement of which has a set of cooccurr ing nouns in a sentence. We (:all these tv~o co-occl l r rence d a t a sets n e a r c o o c c u r r e n c e da ta se ts . Thereaf te r , we app ly the d a t a sets to the 1)ro<:essing of holuonylns. Two s t ra tegies are used to al>l)roach the problem. The first uses the near co -occur rence da ta se t s to select the most feasible word among homonyms in the scope of a sentence. The aim is to eva lua te the possible existen<-e of a n e a r co -occurrence re la t ion , or co-occurrence rela t ion be tweeu a noun and a verb wi thin a sentence. T h e second ewfluates the poss ibh ' exis tence of a f a r co -occurrence re la t ion , referr ing to a cooccurrence re la t ion among words in different sentences. Th is is achieved by cons t ruc t ing f i t r cooccurrence da ta se t s from n e a r co -occurrence da ta s e t s in the course of process ing inpu t sentences. 2 C o o c c u r r e n c e d a t a s e t s The near co -occur rence da ta se t s are (lefined. The first near c o o c c u r r e n c e da ta se t is the set EN ........ each e lement of which(n) is a t r ip le t consist ing of a noun, a case mark ing p a r t M e , and a set of w~rl)s which co-occur wi th t ha t noun and l )a r tMe pa i r in a sentence, as follows: n = ( n o u n , p a r t i c l e , {(Vl, kl ), (v2, ~;2),"" }) Ill this descr ip t ion , p a r t i c l e is a J apanese case mark ing par t ic le , such as 7)'-'; (nomina t ive case), (ac(:usative case), or tC (da t ive case), v i ( i = 1 , 2 , . . ) is a verb, and k i ( i ---1 , 2 , . . . ) is the frequency of occurren(:e of the combina t ion n o u n , p a r t i c l e and vl, which is del ;ermined in the course of cons t ruc t ing EN ...... . fi 'om corpora . The following are examl)les of the e lements of EN ...... .. (g[~ (rain), 7)~ (nominative case), { (~7~ ( fa l l ) ,10 ) , ( lk~2e(s top) ,3 ) , . . } ) ( ~ ( r a i n ) , ~ (accusative case), {(~JT~-~Xa (take precautions) ,3) , . . } )

